Simple aggregation one can do to get metrics at an hourly level

To account for missing records. It is important to account for those records since you might want to put in 0 values if there were no records or use the previous or next time steps for imputation. I removed records for hour 15 to show how you can use the hour 14 timestamp to impute the missing value:

Fast visualization using plotly express

Speed up pandas apply() via Swifter

I sometimes run into long wait times for processing pandas columns even with running code on a notebook with a large instance. Instead, there is an easy one word addition that can be used to speed up the apply functionality in a pandas DataFrame. One only has to import the library swifter.

We are able to reduce the processing time by 64.4% from 13 minutes 38 seconds to 8 minutes 45 seconds.

Multiprocessing using python

While we are on the topic of decreasing time complexity, I often end up dealing with datasets that I wish to process at multiple granularities. Using multiprocessing in python helps me save that time by utilizing multiple workers.

I demonstrate the effectiveness of multiprocessing using the same 50 million rows data frame I created above. Except this time I add a categorical variable which is a random value selected out of a set of vowels.

I used a for loop vs the Process Pool executor from concurrent.futures to demonstrate the runtime reduction we can achieve.

We see a reduction of CPU time by 99.3%. Though one must remember to use these methods carefully since they will not serialize the output therefore using them via grouping can be a good means to leverage this capability.We see a reduction of CPU time by 99.3%. Though one must remember to use these methods carefully since they will not serialize the output therefore using them via grouping can be a good means to leverage this capability.

MASE as a metric

With the rise of using Machine Learning and Deep Learning approaches for time series forecasting, it is essential to use a metric NOT just based on the distance between predicted and actual value. A metric for a forecasting model should use errors from the temporal trend as well to evaluate how well a model is performing instead of just point in time error estimates. Enter Mean Absolute Scaled Error! This metric that takes into account the error we would get if we used a random walk approach where last timestamp’s value would be the forecast for the next timestamp. It compares the error from the model to the error from the naive forecast.

If MASE > 1 then the model is performing worse than a random walk. The closer the MASE is to 0, the better the forecasting model.